Method for simultaneous calibration of mass spectra and identification of peptides in proteomic analysis

ABSTRACT

The invention relates to a mass spectrometry calibration system that may be performed in real-time using the information contained within a sample without the addition of specific calibrants. When applied to a sample, such as a proteomic sample, the calibration system may identify the exact masses of peptides in the sample. The system involves the use of mathematical algorithms that iteratively estimate the error in the measurement and update the calibration parameters accordingly; thereby resulting in peptide mass identification.

This application is the National Phase of International Application PCT/US06/21321, filed May 31, 2006, which designated the U.S. and that International Application was published under PCT Article 21 (2) in English. This application also includes a claim of priority under 35 U.S.C. §119(e) to U.S. provisional patent application No. 60/686,684, filed Jun. 2, 2005.

FIELD OF INVENTION

The invention relates to the calibration of mass spectra obtained in connection with proteomic analysis and to the identification of peptides in connection with the same.

BACKGROUND OF THE INVENTION

In conventional ion cyclotron resonance (“ICR”) mass spectrometers, such as those typically used in connection with Fourier Transform Mass Spectrometry (“FTMS”), charged particles are directed into a magnetic field such that the mass to charge ratio (M/Z) of the particles can be measured. In one application of this technology, as described in U.S. Pat. No. 4,959,543, which is incorporated by reference herein in its entirety, charged particles are subjected to a high voltage pulse and caused to be accelerated to larger radii of gyration relative to the particles' natural radii of gyration. Once excited in this fashion, the charged particles move in circular orbits at frequencies given by the cyclotron equation, ω=B/(M/Z) (where B is the magnetic field strength and ω is the angular frequency). The excited cyclotron motions induce transient signals on a pair of parallel electrodes positioned inside the magnet; the transient signals are a measure of the cyclotron frequency of the particles. In fact, the transient signals are actually a composite of the cyclotron frequencies of all of the ions present in the magnet. By implementing certain Fourier transform mathematics (e.g., a Fast Fourier Transform, or “FFT,” algorithm to extract the frequency and amplitude for each frequency component), these transient signals are converted into a frequency spectrum (i.e., frequency peaks corresponding to each ionic species in the instrument). In this first order model, measured frequencies are converted into M/Z through calibration values when the magnetic field strength (B) is known. There are a number of commercially available products that implement the FTMS technique; for example, Thermo, Bruker, and IonSpec all produce FTMS instruments that generally function in this manner.

As noted above, FTMS exploits the property that an ion of mass M and charge Z placed in a magnetic field of strength B undergoes orbital motion with angular frequency B/(M/Z). In a mass spectrometer, ions must be trapped by an external electrostatic field producing a slight shift in the cyclotron frequency given above. Additional frequency shifts are produced by the electrostatic field produced by the population of ions in the instrument, known as the “space-charge effect” (Gorshov. et al., Amer. Society Mass Spectrom. 4:855-868, 1991). Variations in the frequency observed for a particular ion (with fixed M/Z) can be due to fluctuations in the strength of the magnetic field, trapping voltage, or the “space-charge” effect. Of these three factors, the space-charge effect is believed to be the most difficult to control and to model. Variations in the space-charge effect are significant in liquid-chromatography mass spectrometry (LCMS), the standard technique used in analysis of proteomic samples. These variations are best corrected by active real-time calibration.

Efforts to extract accurate mass information from FTMS by mass calibration have been previously investigated. See L. K. Zhang et al., Mass Spectrometry Reviews, 24:286-309 (2005). Previous methods of FTMS mass calibration include the use of “internal” calibrants, and/or the use of “external” calibrants. In external, or “off-line” calibration, a set of standard molecules of known mass are measured by the instrument separately from the experimental sample. The differences between the measured and true masses are known with certainty, and the calibration parameters are adjusted to minimize these differences. The primary limitation of external calibration is that the calibration parameters do not remain constant from one scan to the next, largely due to the space charge effect. See E. B. Ledford, Jr. et al., Anal. Chem., 56:2744-2748 (1984).

Internal or “on-line” calibration involves the infusion of standard molecules of known mass into an experimental sample, or directly into the mass spectrometer in parallel with the sample, and measuring the mass of the standards and experimental sample in the same scan. However, the signal from the calibrant molecules may obscure a signal arising from the sample through “ion suppression”. Ion suppression occurs because the total ion capacity of an FTMS instrument is generally fixed. Therefore, the calibrant molecules are analyzed at the expense of analyte ions, reducing the measured analyte signal.

A number of methods have attempted to perform calibration without added calibrants in a process called “direct calibration”. One approach (described in M. Mann, Proceedings of the 43^(rd) ASMS Conference on Mass Spectrometry and Allied Topics, Atlanta, 1995) is based upon Mann's insight that peptide masses are confined to clusters of values spaced roughly 1 Dalton (10-100 ppm) apart throughout the spectrum (Wool et al., Proteomics, 2:1365-1373, 2002). While this method may be useful for low mass accuracy mass spectrometers (e.g., MALDI-TOF), it is not suitable for use with higher mass-accuracy systems such as FTMS. In these methods, peptides are either matched to a distribution (not identified) or only peptides that are known to be in the sample a priori are identified.

Another direct calibration method uses the known mass spacings between different charge states of the same molecule as calibration constraints (Bruce et al., JASMS 11:416-421, 2000). However, this method is unable to match the accuracy of FTMS frequency measurements. Yanofsky et al. disclose a method for an internal recalibration of an FTICR-MS analysis (Anal. Chem 77:7246-7254, 2005). However, this method is a limited approach that uses the knowledge of a particular class of proteins, and requires partial knowledge of the sample components. Direct calibration methods have also been used to identify components in wine (Cooper, H. J., and Marshall, A. G., J. Agric. Food Chem, 49:5710-5718), and petroleum products (Marshall A. G. et al., Acc. Chem. Res. 37:53-59, 2004). These methods, however, also require a priori knowledge of the masses of some of the species in the sample.

There is a need in the art for improved calibration and peptide identification techniques in connection with mass spectrometry that obviate at least some of the aforementioned limitations of currently available technology.

SUMMARY OF THE INVENTION

The invention disclosed herein relates to systems and methods useful for producing calibrated mass spectrometry spectra using components of a mass spectrometry sample as calibrants.

Embodiments of the present relate to methods of producing a calibrated mass spectrum, comprising: providing a sample comprising an elemental composition, subjecting the sample to mass spectrometry whereby a mass spectrometry output is obtained, providing input parameters, converting the mass spectrometry output to mass values using the input parameters, estimating error and elemental composition probabilities based on the mass values, updating the input parameters based on the estimated error and elemental composition probabilities, applying the updated input parameters to the mass spectrometry output to produce updated mass values, and repeating several of these steps until convergence is reached, whereby a calibrated mass spectrum is produced.

Further embodiments of the present invention relate to methods wherein the input parameters are selected from the group consisting of a mass database, initial calibration parameters, an initial error estimate, updated calibration parameters, an updated error estimate, and combinations thereof.

Still further embodiments of the present invention relate to methods wherein the mass spectrometry is Fourier transform mass spectrometry.

Other embodiments of the present invention relate to methods wherein the mass spectrometry output comprises cyclotron frequencies, and wherein the elemental composition probabilities are peptide probabilities.

Additional embodiments of the present invention relate to methods wherein the sample is selected from the group consisting of blood, plasma, serum, spinal fluid, urine, sweat, saliva, tears, breast aspirate, prostate fluid, seminal fluid, vaginal fluid, stool, cervical scraping, cytes, amniotic fluid, intraocular fluid, mucous, moisture in breath, animal tissue, cell lysates, tumor tissue, hair, skin, buccal scrapings, nails, bone marrow, cartilage, prions, bone powder, ear wax, and combinations thereof.

Alternative embodiments of the present invention relate to methods wherein the elemental composition comprises at least one peptide.

Other embodiments of the present invention relate to methods wherein the sample is selected from the group consisting of hydrocarbons, petroleum products, nucleotides, combinatorial samples, polymeric samples, and combinations thereof.

Other embodiments of the present invention relate to methods wherein the sample is a petroleum product.

Other embodiments of the present invention relate to methods wherein the estimating the error and elemental composition probabilities comprises using an Expectation Minimization algorithm and/or using a spline algorithm.

Embodiments of the present invention relate to mass spectrometry calibration systems, comprising a mass spectrometry device to analyze a sample and produce a mass spectrometry output, and calibration software configured to receive input parameters, convert the mass spectrometry output to mass values using the input parameters, estimate error and elemental composition probabilities based on the mass values, update input parameters based on the estimated error and elemental composition probabilities, apply the updated input parameters to the mass spectrometry output to produce updated mass values, and repeat several of these steps until convergence is reached, whereby a calibrated mass spectrum is produced.

Further embodiments of the present invention relate to mass spectrometry calibration systems wherein the input parameters are selected from the group consisting of a mass database, initial calibration parameters, an initial error estimate, updated calibration parameters, an updated error estimate, and combinations thereof.

Still further embodiments of the present invention relate to mass spectrometry calibration systems wherein the mass spectrometry device is a Fourier transform mass spectrometer.

Other embodiments of the present invention relate to mass spectrometry calibration systems wherein the mass spectrometry output comprises cyclotron frequencies, and wherein the elemental composition probabilities are peptide probabilities.

Further embodiments of the present invention relate to mass spectrometry calibration systems wherein the sample is selected from the group consisting of blood, plasma, serum, spinal fluid, urine, sweat, saliva, tears, breast aspirate, prostate fluid, seminal fluid, vaginal fluid, stool, cervical scraping, cytes, amniotic fluid, intraocular fluid, mucous, moisture in breath, animal tissue, cell lysates, tumor tissue, hair, skin, buccal scrapings, nails, bone marrow, cartilage, prions, bone powder, ear wax, and combinations thereof.

Still further embodiments of the present invention relate to mass spectrometry calibration systems wherein the sample comprises at least one peptide.

Additional embodiments of the present invention relate to mass spectrometry calibration systems wherein the sample is selected from the group consisting of hydrocarbons, petroleum products, nucleotides, combinatorial samples, polymeric samples, and combinations thereof.

Other embodiments of the present invention relate to mass spectrometry calibration systems wherein the sample is a petroleum product.

Further embodiments of the present invention relate to mass spectrometry calibration systems wherein the software is configured to estimate the error and the elemental composition probabilities using an Expectation Minimization algorithm, and/or using a spline algorithm.

Embodiments of the present invention also relate to a computer-readable medium having computer-executable instructions that when executed perform a method, the method comprising converting a mass spectrometry output to mass values using input parameters, estimating error and elemental composition probabilities based on the mass values, updating the input parameters based on the estimated error and elemental composition probabilities, applying the updated input parameters to the mass spectrometry output to produce updated mass values, and repeating several of these steps until convergence is reached, whereby a calibrated mass spectrum is produced.

Further embodiments of the present invention relate to computer-readable media wherein the input parameters are selected from the group consisting of a mass database, initial calibration parameters, an initial error estimate, and combinations thereof.

Still further embodiments of the present invention relate to computer-readable media wherein the estimating the error and the elemental composition probabilities uses an Expectation Minimization algorithm and/or a spline algorithm.

Other embodiments of the present invention relate to computer-readable media wherein the mass spectrometry output is produced by a Fourier transform mass spectrometer.

Additional embodiments of the present invention relate to computer-readable media wherein the mass spectrometry output comprises cyclotron frequencies.

Further embodiments of the present invention relate to computer-readable media wherein the elemental composition probabilities are peptide probabilities.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 depicts a flow chart, illustrating a method of simultaneous calibration of mass spectra and elemental composition identification in accordance with an embodiment of the present invention.

FIG. 2A shows a distribution of peptide masses in the human proteome in accordance with an embodiment of the present invention.

FIG. 2B is an inset of FIG. 2A in accordance with an embodiment of the present invention. It shows nominal mass clusters near 1,000 Da.

FIG. 2C is an inset of FIG. 2B in accordance with an embodiment of the present invention. The panel shows five individual peptide masses designated by the peak numbers A through E.

FIG. 3A shows the estimation of frequencies from a mass spectrum in accordance with an embodiment of the present invention.

FIG. 3B shows a graph depicting the conversion of frequencies to masses by estimating calibration parameters in accordance with an embodiment of the present invention.

FIG. 4 shows a more detailed overview of the calibration process in accordance with an embodiment of the present invention.

FIG. 5 shows the results of a calibration test in accordance with an embodiment of the present invention.

DESCRIPTION OF THE INVENTION

Unless defined otherwise, technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. One skilled in the art will recognize many methods and materials similar or equivalent to those described herein, which could be used in the practice of the present invention. Indeed, the present invention is in no way limited to the methods and materials described.

Embodiments of the present invention relate to systems and methods for calibration and peptide identification in connection with mass spectrometry; in particular, with FTMS. Furthermore, the present invention exploits the natural relationship between peptide identification and calibration to solve two related problems simultaneously, and to iteratively improve the solutions for each. Most conventional calibration methods require calibrant molecules of known mass to be added to a sample. The present invention, however, is based upon an iterative process of identifying components in the sample and using these identified components as calibrants.

While preferred embodiments of the inventive systems and methods relate to peptide calibration, they may readily be applied to other types of chemicals or compounds. As used herein, the general term “elemental composition” includes all types of compounds, including peptides, that may be analyzed using the systems and methods disclosed herein.

Most calibration methods in current use require the addition of calibrant molecules of known mass into a sample. Alternatively, the inventive direct calibration methods use the components of the sample alone to provide dozens of calibrants covering the entire mass spectrum. Direct calibration methods save time and materials, simplify the experimental apparatus and protocol, perform calibration in real time each time a spectrum is generated, avoid obscuration of information that can result from ion suppression, resulting in significant improvements in accuracy. The higher mass accuracy of FTMS systems allow the identification of elemental compositions from a large pool of candidates, for example, human tryptic peptides or petroleum components. Increased calibration accuracy results from the ability to use more species in the calibration and the positive feedback between identification and calibration.

FIG. 1 shows a general overview of the calibration system (100). First, a sample may be analyzed by mass spectrometry to produce a mass spectrometry output (101). For example, with FTMS, the mass spectrometry output comprises cyclotron frequencies. The mass spectrometry output, along with other initial input parameters (102), such as a mass database (ENSEMBL, for example), calibration parameters, and error estimates may be used to convert the mass spectrometry output to mass values (103). The error as well as the probabilities for the elemental compositions may then be estimated (104), and the calibration parameters may be updated (105). The updated calibration parameters may then be used to again convert mass spectrometry output to mass values. Steps 103 through 105 may repeated any number of times until the data reach convergence. The converged data, or converged calibration output, may then be stored or displayed in any suitable computer-readable or printed format (106). In certain embodiments of the invention, the output of the mass spectrometry calibration system is a calibrated mass spectrum.

In accordance with an embodiment of the present invention, calibration may be performed in real-time using the information contained in a sample without the addition of specific calibrants. A sample comprising peptides, for example, a proteomic sample, may be subjected to a mass spectrometry, for example, FTMS, using instruments and methods that are well known in the art. As shown in FIGS. 2A through 2C, Individual human tryptic peptide masses may be resolved at around 1 ppm accuracy. Table 1 shows for example, the number of peptide mass values that may be analyzed. FIG. 2A shows the entire distribution of mass values in the human proteome. FIG. 2B is an inset of the region of FIG. 2A (inset region designated by the rectangular bar). This figure shows the nominal mass clusters near 1000 Da. FIG. 2C is an inset of the region of FIG. 2B (inset region designated by the rectangular bar). This figure shows five individual peptide masses. The box below the graph designates the mass for peaks A through E in the figure.

TABLE 1 human protein sequences 50,071 (as provided by IPI, ENSEMBL) ideal tryptic peptides 2,515,788 distinct sequences 808,076 distinct masses 356,933

In FTMS, an ionized peptide's mass-to-charge ratio is estimated by estimating the frequency of its circular motion induced by a centripetal magnetic force. The ion induces an image charge, or transient voltage signal, on either of two parallel detection plates as it passes. The observed frequency is calculated from a peak in the Fourier transform of the transient voltage between the plates.

The “observed” mass is derived in a two-step process; 1) extraction of ion frequencies, and 2) conversion of frequencies to mass by calibration. As shown in FIG. 3, calibration of the FT mass spectrometer is the process by which each observed frequency (a peak in a spectrum) is converted into a mass-to-charge value. In FTMS, the measured quantity is frequency, and mass “measurements” are derived from frequencies. Calibration may be thought of as an optimization problem: given a family of calibration equations such that there is a one-to-one correspondence with vectors of real-valued parameters, choose an equation (or equivalently parameter values) that minimizes a cost function. In this case, the cost function is the estimated variance of the normalized error.

FIG. 4 shows the calibration process for FTMS in more detail. Table 2 shows the definitions of the symbols used in FIG. 4. Box 401 comprises the input parameters. The input parameters include M, which denotes a peptide mass database, A⁽⁰⁾ and B⁽⁰⁾ the initial calibration parameters, f, the observed frequencies from the mass spectrometer, and σ⁽⁰⁾, the initial error estimate. A⁽⁰⁾, B⁽⁰⁾, and σ⁽⁰⁾ are only used in the first iteration. The values A⁽⁰⁾ and B⁽⁰⁾ are used to convert the observed frequencies to mass values (402). The value σ⁽⁰⁾ is used to calculate initial peptide mass distributions.

TABLE 2 Symbol Definition f = (f₁ . . . f_(n)) observed frequencies M = (M₁ . . . M_(N)) peptide mass database A^((k)), B^((k)) calibration parameters σ^((k)) error estimate m^((k)) = (m₁ ^((k)) . . . m_(n) ^((k))) calibrated mass p^((k)) = [p_(ij) ^((k))]_(i = [1 . . . n],j = [1 . . . N]) probability matrix p_(ij) probability that frequency I (came from mass M_(i))

The mass values are then subjected to an iterative process wherein a mathematical algorithm, such as the Expectation Maximization (EM) algorithm is applied, allowing for the estimation of error in the probabilities that are assigned to the mass values (403). A comprehensive description of the EM algorithm is provided in a publication by Dempster et al. (J. Royal Statistical Society B, 39:1-38, 1977), which is incorporated herein by reference in its entirety. The use of the EM algorithm for calibration is described in the Examples. The revised error estimates allow for the calculation of updated calibration parameters (404), A(^(k)) and B(^(k)). These calibration parameters are then re-applied to the mass values. The processes designated by boxes 402 through 404 are repeated until the updated calibration parameters no longer change from the values in the subsequent iterations. This stage is referred to as “convergence” (405).

In general, the frequency is inserted into a calibration equation to obtain the mass-to-charge ratio of the ionized peptide. The calibration equation has a set of parameters whose values are taken to be fixed in the initial step of the calculation. Subsequently, the calibration parameters are tuned to minimize the estimated normalized error.

The second step is to estimate the charge on the peptide by examining the positions of adjacent peaks that are presumed to be species with identical elemental composition and charge, differing only in isotopic composition. Since these mass differences between isotopes are approximately one atomic mass unit, a peptide with charge z would produce a set of peaks with uniform peaks separated by 1/z units in mass-to-charge.

To first order, the mass-to-charge ratio is linearly proportional to the period of the ion's revolution; the constant of proportionality is the magnitude of the magnetic field. The very high accuracy of the FTMS, however, exposes systematic errors in the simple first-order model. Higher-order effects depend upon the geometry of the analytic chamber and the “space-charge effect”—interactions between multiple ionic species present within the chamber. A term that depends upon the square of the period is commonly used to account for these effects. A review by Zhang et al. describes some of the development of these models (Mass Spectrometry Reviews 24:286-309, 2005).

For example, a collection of peptide mass measurements and a database of exact peptide mass values may be provided. There are several databases comprising exact peptide mass values that are known in the art. For example, the ENSEMBL database (Hubbard T. et al., Nucleic Acids Res 33:D447-D453, 2005) and the European Bioinformatics Institute (EBI) both provide comprehensive lists of peptides and peptide masses. Alternatively, the calculated masses of an “in silico” tryptic digest of a proteome, for example, the human proteome, may be used as a peptide mass database. For elemental compositions other than peptides, such as petroleum products, polymers, or combinatorial libraries, alternative mass databases may be used that are apparent to those of skill in the art.

The calibration process proceeds iteratively. At each step, the calibration parameters are updated to minimize the variance of the normalized error using the current estimate of the probability mass distribution for the exact mass identity (elemental composition, e.g., peptide). The updated calibration parameters change the mass values that are computed from the observed frequencies. These new values will result in a new (initial) estimate for the normalized error variance. This initial estimate will be refined by the EM algorithm, resulting in a updated estimate of the normalized error variance and a new set of probability mass distributions for the exact mass identity of each measurement. This procedure of iterating calibration steps and applications of the EM algorithm to update the exact mass probabilities is repeated to convergence. The term “convergence,” as used herein occurs when subsequent iterations result in essentially the same values of the calibration parameters A and B. An example of this process is shown in Example 4.

The calibration system disclosed herein may be used with a number of different mass spectrometry systems and configurations that are known in the art. While an embodiment involves the use of the calibration system with FTMS, it may also be used with other types of mass spectrometry such as time-of-flight (TOF) mass spectrometry, given that the mass accuracy is sufficient.

The calibration system disclosed herein may be used on a variety of different sample types. In a preferred embodiment, the calibration system is used with samples comprising peptides in a biological sample. For example, a proteomic sample may be analyzed. A wide array of biological samples may be obtained and used in conjunction with alternate embodiments of the system (e.g., a body fluid, such as blood, plasma, serum, CSF (spinal fluid), urine, sweat, saliva, tears, breast aspirate, prostate fluid, seminal fluid, vaginal fluid, stool, cervical scraping, cytes, amniotic fluid, intraocular fluid, mucous, moisture in breath, animal tissue, cell lysates, tumor tissue, hair, skin, buccal scrapings, nails, bone marrow, cartilage, prions, bone powder, ear wax, etc.). In addition, non-mammalian biological samples may be analyzed using the systems and methods disclosed herein. For example, samples of elemental compositions obtained from plants, bacteria, fungi, soil, and water may be analyzed.

In addition to biological samples comprising peptides, the calibration systems and methods disclosed herein may be used to analyze any number of different types of samples that will be readily apparent to those of skill in the art. Other examples of chemical compounds or elemental compositions that may be analyzed in this manner include but are by no means limited to polynucleotides, hydrocarbon or petroleum products, combinatorial libraries, and polymeric samples. Further, the calibration system may also be used to analyze the compounds or elemental compositions present in liquids such wine or other beverages. The calibration method requires that most components belong to a finite, but large set of possible elemental compositions. The size of this set can be as large as 10⁵-10⁶, and is limited only by the accuracy of the MS instrument.

For peptide applications of the calibration system, samples may be prepared using any suitable method. Many such methods are known in the art. For example, a proteomic sample may be digested with a protease such as trypsin to produce smaller peptides. Prior to introduction into the mass spectrometer, the peptides may be fractionated by a variety of methods, including chromatographic methods such as reverse-phase, size exclusion, or ion exchange chromatography, or by electrophoretic methods such as SDS-PAGE.

The mass spectrometry calibration system disclosed herein generally comprises “calibration software” that facilitates the mathematical calculations necessary for calibration. The calibration software may be stored as machine readable code on a computer that may be in communication with the mass spectrometry system. Alternatively, the calibration system may be applied to the output of a mass spectrometer separately from the mass spectrometry system. The software may be stored on any suitable computational device. For example, the software as well as the means for its execution may be integrated with the mass spectrometry instrument, or housed separately on a computer or any type of suitable electronic storage device. Examples include but are no means limited to hard disks or drives, CD-ROMs, DVDs, and removable storage devices such as USB drives and flash drives. Nearly any hardware, firmware, software, operating system, database platform, networking technique or other conventional computer tool can be configured to operate in connection with the system and methods of the present invention, as will be appreciated by those of skill in the art.

In an alternative embodiment of the invention, an algorithm is utilized that finds a spline curve (continuous in first derivative) that minimizes the weighted squared distance to identified masses. The use of spline in a high-order, locally deformable calibration model to fit a large number of calibrants is believed to be one of the novel features of the instant invention. The spline may be constructed from segments of the form M/Z=A/f+B/f²+C. The weight associated with each calibrant point reflects the probability that a given mass has been identified correctly. Each spline segment may contain at least N points (e.g., N=10, N=20, etc.) to prevent overfitting. Indeed, generally speaking, the estimation of calibration (spline) parameters is the solution to a constrained optimization problem. The solution is the point where the vector normal to the constraint space (sets of parameters which are valid splines—i.e., smooth curves) is parallel to the gradient of the objective function (i.e., the sum of squared differences between observed and calculated mass values). Example 6 demonstrates how a spline algorithm may be used in the calibration process.

EXAMPLE 1 Assessment of a Peptide's Exact Mass from a Mass Measurement with Known Error

In this Example, the mass of a peptide is measured, and the measured mass is denoted as β. To make an inference about the true mass of the peptide from the measured value, a quantitative model of the measurement process is needed. The measurement of a peptide with mass a can be modeled as the sum of the true mass α plus an error term, e.

The error term, denoted by “e”, is a normally distributed random variable with mean zero and variance σ². The conditional probability density, p(β|α), evaluated at β is given below.

$\begin{matrix} {{p\left( \beta \middle| \alpha \right)} = {\left( {2{\pi\sigma}^{2}} \right)^{{- 1}/2}{\exp\left( \frac{- \left( {\beta - \alpha} \right)^{2}}{2\sigma^{2}} \right)}}} & (1) \end{matrix}$

For the purposes of this example, a database of all possible exact mass values may be provided, and the set of these values may be denoted by {α₁, α₂ . . . α_(r)}. Peptide exact mass assessment involves assigning probabilities to the possible mass values, p(α_(j)|β), j [1 . . . r], given the measured value β. These probabilities may be computed in terms of our measurement model and Bayes' Law.

$\begin{matrix} {{p\left( \alpha_{j} \middle| \beta \right)} = \frac{{p\left( \alpha_{j} \right)}{p\left( \beta \middle| \alpha_{j} \right)}}{\sum\limits_{j = 1}^{r}{{p\left( \alpha_{j} \right)}{p\left( \beta \middle| \alpha_{j} \right)}}}} & (2) \end{matrix}$

The factor p(α_(φ)) in the above equation denotes the a priori (before measurement) probability that the peptide has mass α_(j). If there is no a priori information about the peptide mass values, p(α_(j))=1/r, for all j in [1 . . . r]. For example, it is possible to assign theoretical a priori probabilities to peptide elemental compositions.

Although the above equation assigns non-zero probability to all possible mass values, the probability assigned to values differing from β by more than 5σ is quite small and can be neglected. In some cases, only one exact mass value will have significant probability.

EXAMPLE 2 Estimation of Mass Measurement Error Variance from Measurements of Known Peptides

A related calculation is the estimation of the variance of the mass measurement error e from a collection of measurements of peptides of known masses. For example, in this case, one may have q peptides with masses α_(m(1)), α_(m(2)), . . . α_(m(q)) respectively. Each peptide in sequence may be measured resulting in measured values β₁, β₂, . . . β_(q) respectively. That is, for each i from 1 to q, β_(i) is the measured value of the ith peptide, whose true mass is α_(m(i)).

If it is known that when measurement errors are independent and identically distributed normal random variables with mean zero, the maximum likelihood estimate of the variance of the error may be computed. Let σ² denote the (unknown) variance of the error. The probability density for the measured value of a peptide with mass α_(m(i)), evaluated at the value β₁ is given by Equation 1.

Let N-component vectors α and β denote the ordered collections of true and measured masses respectively. Then the probability density for the entire set of measured values, evaluated at b, is given by Equation 3

$\begin{matrix} {{p\left( {\left. \beta \middle| \alpha \right.,\sigma^{2}} \right)} = {{\left( {2{\pi\sigma}^{2}} \right)^{{- q}/2}{\prod\limits_{i = 1}^{q}{\exp\left( \frac{- \left( {\beta_{i} - \alpha_{m{(i)}}} \right)^{2}}{2\sigma^{2}} \right)}}} = {\left( {2{\pi\sigma}^{2}} \right)^{{- q}/2}{\exp\left( \frac{- {{\beta - \alpha}}^{2}}{2\sigma^{2}} \right)}}}} & (3) \end{matrix}$

where ∥β−α∥² denotes the squared Euclidean distance between β and α, that is, the sum of the squared component differences.

Let {circumflex over (σ)}² denote the maximum-likelihood estimate of the error variance, the value of σ² that maximizes the right-hand side of Equation 3. It is equivalent and more convenient, to maximize the logarithm of this quantity. First, the first-derivative is evaluated with respect to σ².

$\begin{matrix} {{\frac{\mathbb{d}}{\mathbb{d}\sigma^{2}}\log\;{p\left( {\left. \beta \middle| \alpha \right.,\sigma^{2}} \right)}} = {{\frac{\mathbb{d}}{\mathbb{d}\sigma^{2}}{\log\left( {{{- \frac{q}{2}}{\log\left( {2{\pi\sigma}^{2}} \right)}} - \frac{- {{\beta - \alpha}}^{2}}{2\sigma^{2}}} \right)}} = {{- \frac{q}{2\sigma^{2}}} + \frac{- {{\beta - \alpha}}^{2}}{2\left( \sigma^{2} \right)^{2}}}}} & (4) \end{matrix}$

The log-likelihood has zero first-derivative at {circumflex over (σ)}², and its value is determined as shown in Equation 5.

$\begin{matrix} {{{\left. {\frac{\mathbb{d}}{\mathbb{d}\sigma^{2}}\log\;{p\left( {\left. \beta \middle| \alpha \right.,\sigma^{2}} \right)}} \right|_{\sigma^{2} = {\hat{\sigma}}^{2}} = 0}{\hat{\sigma}}^{2} = {\frac{{{\beta - \alpha}}^{2}}{q} = {\frac{1}{q}{\sum\limits_{i = 1}^{q}\left( {\beta_{i} - \alpha_{m{(i)}}} \right)^{2}}}}}\;} & (5) \end{matrix}$

The maximum-likelihood estimate of the variance is simply the mean of the squared difference between measured and true values.

In mass spectrometry, the average magnitude of the error, for repeated measurements of the same peptide, is linearly proportional to the mass of the measured peptide. Furthermore, the measurement accuracy of a mass spectrometry is characterized by the average magnitude of the error expressed in parts per million (ppm) of the measured mass. For example, a peptide of mass α is measured and the resulting measurement error is e. That is, the measured value is α+e. Let e′ denote the normalized measurement error (expressed in ppm) defined by Equation 6.

$\begin{matrix} {e^{\prime} = {10^{6}\frac{e}{\alpha}}} & (6) \end{matrix}$

Let (σ′)² denote the variance of the normalized error. Let ({circumflex over (σ)}′)² denote the maximum-likelihood estimate of this quantity. The estimation of the normalized error variance is similar to that of the unnormalized error variance and given by Equation 7.

$\begin{matrix} {\left( {\hat{\sigma}}^{\prime} \right)^{2} = {\frac{1}{q}{\sum\limits_{i = 1}^{q}\left( \frac{\beta_{i} - \alpha_{m{(i)}}}{10^{- 6}\alpha_{m{(i)}}} \right)^{2}}}} & (7) \end{matrix}$

EXAMPLE 3 Estimation of Measurement Error from Measurements of Unidentified Peptides

In the previous two examples, it was demonstrated 1) how to assess a peptide's exact mass from a mass measurement when the measurement error is known and 2) how to estimate the measurement error from a collection of known peptides. In this Example, the maximum likelihood estimate of the normalized measurement error variance from measurements of unidentified peptides will be derived. This solution will be interpreted in terms of the solutions of the problems in Examples 1 and 2.

In this Example, one has a database of all possible exact mass values denoted by a=(α₁, α₂, . . . α_(r)) and a collection of mutually independently measured peptide masses b=(β₁, β₂, . . . β_(q)). There exists a mapping m: [1 . . . q]→[1 . . . r] such that for each i in [1 . . . q], measured value β_(i) resulted from measuring a peptide with mass α_(m(i)). If this mapping were known, it would be possible to estimate the normalized error variance directly as described in the Example 2. In this sense, the quantities {α, β, m} form a complete data set. Let ({circumflex over (σ)}′)²|α,β,m denote the estimate of (σ′)² given α, β, and m. Instead the mapping m may be inferred (or better, averaged over possible realizations of m) to estimate (σ′)² for the incomplete data set {α, β}.

One possible method for constructing this estimate would be to start with an initial (incorrect) estimate of (σ′)². Let └({circumflex over (σ)}′)²┘₀ denote this initial estimate. Then, assuming that the error parameter is actually └({circumflex over (σ)}′)²┘₀, for each measurement β_(i), calculate the probability that the exact mass value is aj. These probabilities p(α_(j)|β_(i), └({circumflex over (σ)}′)²┘₀) are computed substituting β_(i) for β in Equation 2 and (10⁶α_(j))²└({circumflex over (σ)}′)²┘₀ for σ² in Equation 1.

Then, the updated estimate of the measurement error is the weighted average over each pair of measurements and possible exact mass value (β_(i), α_(j)). The weights are the probabilities p(α_(j)|β_(i),└({circumflex over (σ)}′)²┘₀) computed above. In general, if ({circumflex over (σ)}′)_(n) ² denotes the estimated variance after n iterations, the subsequent estimate ({circumflex over (σ)}′)_(n+1) ² is given by Equation 8.

$\begin{matrix} {\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n + 1} = {\frac{1}{q}{\sum\limits_{i = 1}^{q}{\sum\limits_{j = 1}^{r}{\left( \frac{\beta_{i} - \alpha_{j}}{10^{- 6}\alpha_{j}} \right)^{2}{p\left( {\left. \alpha_{j} \middle| \beta_{i} \right.,\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n}} \right)}}}}}} & (8) \end{matrix}$

Like Equation 7, Equation 8 is the average of the observed deviations between the measured and exact mass. In Equation 8, each possible exact mass value is weighted by its conditional probability given the measured value β_(i) and the previous estimate of the normalized error variance, └({circumflex over (σ)}′)²┘_(n). These probabilities are computed as shown in Equation 2. Equation 8 reduces to Equation 7 if p(α_(j)|β_(i), └({circumflex over (σ)}′)²┘_(n)) is set equal to δ_(ij), i.e. with probability one, the exact mass corresponding to measurement β_(i) is α_(i).

The formal derivation of Equation 8 using the EM algorithm is given in Example 5.

Starting from an initial estimate of the normalized error variance (e.g. └({circumflex over (σ)}′)²┘₀=1), Equation 8 is recalculated repeatedly until the estimate converges. This process is guaranteed to converge to the maximum likelihood estimate of the normalized error variance, as it is a realization of the generalized Expectation-Maximization (EM) algorithm.

Each step of the EM algorithm averages over all possible “completions” of the data, in this case, all possible peptide identifications. As the algorithm converges to a stable estimate of the error, it also produces increasingly accurate probabilistic peptide identifications.

EXAMPLE 4 Calibration of Fourier-Transform Mass Spectra A Two-Parameter Calibration from a Spectrum of Unknown Peptide

A set of frequencies (f^(obs) ₁, f^(obs) ₂, f^(obs) _(q)) corresponding to the cyclotron motion of the monoisotopic species of a peptide may be extracted from the spectrum. It is also assumed that the charges of the peptides may also be determined unambiguously from the sequence of frequencies of isotopically related species. Let (z₁, z₂, . . . z_(q)) denote the corresponding charges.

Let A and B denote undetermined calibration parameters in the following functional form relating observed frequencies to mass-over-charge ratio:

$\left( \frac{m}{z} \right)^{obs} = {{A\frac{1}{f^{obs}}} + {B\frac{1}{\left( f^{obs} \right)^{2}}}}$

Solving for the mass, the related equation below is obtained:

$m^{obs} = {z\left( {{A\frac{1}{f^{obs}}} + {B\frac{1}{\left( f^{obs} \right)^{2}}}} \right)}$

The calibration problem involves finding values A* and B* that minimize the estimated average squared (normalized) difference between the true value of the mass and the value calculated from the observed frequency, the charge, and the calibration parameters as in the above equation.

It will be shown that the values of A* and B* may be determined by solving two linear equations in two unknowns.

It is assumed that the possible exact mass values are given by {a1, a2, . . . ar}. The expected squared error is given in Equation 8 where bi is replaced by m_(i) ^(obs). In addition, the probabilities assigned to the exact mass values will be taken as fixed. As a shorthand notion, let p_(ij) represent the quantity p(α_(j)|m_(i) ^(obs),({circumflex over (σ)}′)²).

Equation 8 is re-written in this new notation.

${\hat{\sigma}}^{2} = {\frac{1}{q}{\sum\limits_{i = 1}^{q}\;{\sum\limits_{j = 1}^{r}\;{\left( \frac{m_{i}^{obs} - \alpha_{j}}{10^{- 6}\alpha_{j}} \right)^{2}p_{ij}}}}}$

Then, m_(i) ^(obs) is replaced with the calibration formula.

${\hat{\sigma}}^{2} = {\frac{1}{q}{\sum\limits_{i = 1}^{q}\;{\sum\limits_{j = 1}^{r}\;{\left( \frac{{z_{i}\left( {{A\frac{1}{f_{i}^{obs}}} + {B\frac{1}{\left( f_{i}^{obs} \right)^{2}}}} \right)} - \alpha_{j}}{10^{- 6}\alpha_{j}} \right)^{2}p_{ij}}}}}$

Now both sides are differentiated with respect to each calibration parameter.

$\frac{\partial\left( {\hat{\sigma}}^{2} \right)}{\partial A} = {\frac{1}{q}{\sum\limits_{i = 1}^{q}\;{\sum\limits_{j = 1}^{r}\;{\left( \frac{{z_{i}\left( {{A\frac{1}{f_{i}^{obs}}} + {B\frac{1}{\left( f_{i}^{obs} \right)^{2}}}} \right)} - \alpha_{j}}{10^{- 6}\alpha_{j}} \right)\left( \frac{z_{i}}{f_{i}^{obs}10^{- 6}\alpha_{j}} \right)p_{ij}}}}}$ $\frac{\partial\left( {\hat{\sigma}}^{2} \right)}{\partial B} = {\frac{1}{q}{\sum\limits_{i = 1}^{q}\;{\sum\limits_{j = 1}^{r}\;{\left( \frac{{z_{i}\left( {{A\frac{1}{f_{i}^{obs}}} + {B\frac{1}{\left( f_{i}^{obs} \right)^{2}}}} \right)} - \alpha_{j}}{10^{- 6}\alpha_{j}} \right)\left( \frac{z_{i}}{\left( f_{i}^{obs} \right)^{2}10^{- 6}\alpha_{j}} \right)p_{ij}}}}}$

When the above derivatives are evaluated at (A*,B*), each is equal to zero, since (A*,B*) minimizes {circumflex over (σ)}².

${{\frac{A^{*}}{q}{\sum\limits_{i = 1}^{q}\;{\sum\limits_{j = 1}^{r}\;{\left( \frac{z_{i}^{2}}{\left( f_{i}^{obs} \right)^{2}} \right)\left( \frac{1}{\left( {10^{- 6}\alpha_{j}} \right)^{2}} \right)p_{ij}}}}} + {\frac{B^{*}}{q}{\sum\limits_{i = 1}^{q}\;{\sum\limits_{j = 1}^{r}\;{\left( \frac{z_{i}^{2}}{\left( f_{i}^{obs} \right)^{3}} \right)\left( \frac{1}{\left( {10^{- 6}\alpha_{j}} \right)^{2}} \right)p_{ij}}}}}} = {\frac{1}{q}{\sum\limits_{i = 1}^{q}\;{\sum\limits_{j = 1}^{r}\;{\left( \frac{\alpha_{j}}{\left( {10^{- 6}\alpha_{j}} \right)^{2}} \right)\left( \frac{z_{i}}{f_{i}^{obs}} \right)p_{ij}}}}}$ ${{\frac{A^{*}}{q}{\sum\limits_{i = 1}^{q}\;{\sum\limits_{j = 1}^{r}\;{\left( \frac{z_{i}^{2}}{\left( f_{i}^{obs} \right)^{3}} \right)\left( \frac{1}{\left( {10^{- 6}\alpha_{j}} \right)^{2}} \right)p_{ij}}}}} + {\frac{B^{*}}{q}{\sum\limits_{i = 1}^{q}\;{\sum\limits_{j = 1}^{r}\;{\left( \frac{z_{i}^{2}}{\left( f_{i}^{obs} \right)^{4}} \right)\left( \frac{1}{\left( {10^{- 6}\alpha_{j}} \right)^{2}} \right)p_{ij}}}}}} = {\frac{1}{q}{\sum\limits_{i = 1}^{q}\;{\sum\limits_{j = 1}^{r}\;{\left( \frac{\alpha_{j}}{\left( {10^{- 6}\alpha_{j}} \right)^{2}} \right)\left( \frac{z_{i}}{\left( f_{i}^{obs} \right)^{2}} \right)p_{ij}}}}}$

The two equations above are re-written as a single matrix equation.

$\begin{bmatrix} {\sum\limits_{i = 1}^{q}\;{\frac{z_{i}^{2}}{\left( f_{i}^{obs} \right)^{2}}{\sum\limits_{j = 1}^{r}\;\frac{p_{ij}}{\alpha_{j}^{2}}}}} & {\sum\limits_{i = 1}^{q}\;{\frac{z_{i}^{2}}{\left( f_{i}^{obs} \right)^{3}}{\sum\limits_{j = 1}^{r}\;\frac{p_{ij}}{\alpha_{j}^{2}}}}} \\ {\sum\limits_{i = 1}^{q}\;{\frac{z_{i}^{2}}{\left( f_{i}^{obs} \right)^{3}}{\sum\limits_{j = 1}^{r}\;\frac{p_{ij}}{\alpha_{j}^{2}}}}} & {\sum\limits_{i = 1}^{q}\;{\frac{z_{i}^{2}}{\left( f_{i}^{obs} \right)^{4}}{\sum\limits_{j = 1}^{r}\;\frac{p_{ij}}{\alpha_{j}^{2}}}}} \end{bmatrix}{\quad{\begin{bmatrix} A^{*} \\ B^{*} \end{bmatrix} = \begin{bmatrix} {\sum\limits_{i = 1}^{q}\;{\frac{z_{i}}{f_{i}^{obs}}{\sum\limits_{j = 1}^{r}\;\frac{p_{ij}}{\alpha_{j}}}}} \\ {\sum\limits_{i = 1}^{q}\;{\frac{z_{i}}{\left( f_{i}^{obs} \right)^{2}}{\sum\limits_{j = 1}^{r}\;\frac{p_{ij}}{\alpha_{j}}}}} \end{bmatrix}}}$

Finally, the optimal values of the calibration parameters may be solved.

$\begin{bmatrix} A^{*} \\ B^{*} \end{bmatrix} = {\begin{bmatrix} {\sum\limits_{i = 1}^{q}\;{\frac{z_{i}^{2}}{\left( f_{i}^{obs} \right)^{2}}{\sum\limits_{j = 1}^{r}\;\frac{p_{ij}}{\alpha_{j}^{2}}}}} & {\sum\limits_{i = 1}^{q}\;{\frac{z_{i}^{2}}{\left( f_{i}^{obs} \right)^{3}}{\sum\limits_{j = 1}^{r}\;\frac{p_{ij}}{\alpha_{j}^{2}}}}} \\ {\sum\limits_{i = 1}^{q}\;{\frac{z_{i}^{2}}{\left( f_{i}^{obs} \right)^{3}}{\sum\limits_{j = 1}^{r}\;\frac{p_{ij}}{\alpha_{j}^{2}}}}} & {\sum\limits_{i = 1}^{q}\;{\frac{z_{i}^{2}}{\left( f_{i}^{obs} \right)^{4}}{\sum\limits_{j = 1}^{r}\;\frac{p_{ij}}{\alpha_{j}^{2}}}}} \end{bmatrix}^{- 1}{\quad\begin{bmatrix} {\sum\limits_{i = 1}^{q}\;{\frac{z_{i}}{f_{i}^{obs}}{\sum\limits_{j = 1}^{r}\;\frac{p_{ij}}{\alpha_{j}}}}} \\ {\sum\limits_{i = 1}^{q}\;{\frac{z_{i}}{\left( f_{i}^{obs} \right)^{2}}{\sum\limits_{j = 1}^{r}\;\frac{p_{ij}}{\alpha_{j}}}}} \end{bmatrix}}}$

After the new values A* and B* have been used to recalculate the observed masses, m_(i) ^(obs), the error estimate may be reduced. As a result, the probabilities assigned to the exact masses for each measurement p_(ij) shift so that more weight is placed upon candidates that are close to the calculated mass value. The EM algorithm may be run again to simultaneously determine the overall error and the individual probabilities. After the probabilities are updated, the values of A* and B* that have just been calculated are no longer optimal and may be recalculated. This procedure of iterating calibration steps and applications of the EM algorithm to update the exact mass probabilities is repeated to convergence.

EXAMPLE 5 Derivation of the Update Step in the Application of the EM Algorithm

By definition of the EM algorithm, the estimate of the normalized error variance in step n+1, └({circumflex over (σ)}′)²┘_(n+1), is the value that maximizes the function Q (the expectation) calculated from the estimate obtained in step n, └({circumflex over (σ)}′)²┘_(n).

$\begin{matrix} {\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n + 1} = {\begin{matrix} {\arg\mspace{11mu}\max} \\ {\left( \sigma^{\prime} \right)^{2} \in {R +}} \end{matrix}{Q\left( \left( \sigma^{\prime} \right)^{2} \middle| \left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n} \right)}}} & (9) \end{matrix}$

The function Q is defined as the expectation of the log-likelihood of the complete data given the undetermined normalized error variance, (σ′)². The complete data is the set of observed measurements β plus the exact masses of the measured peptides, denoted by the mapping m. The possible completions of the data, the exact peptide masses, are considered to be drawn from the conditional distribution given the measurements β with the normalized error variance taken to be └({circumflex over (σ)}′)²┘_(n).

$\begin{matrix} {{Q\left( \left( \sigma^{\prime} \right)^{2} \middle| \left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n} \right)} = {{E\left\lbrack {\left. {\log\;{p\left( {\beta,\left. m \middle| \alpha \right.,\left( \sigma^{\prime} \right)^{2}} \right)}} \middle| \alpha \right.,\beta,\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n}} \right\rbrack} = {\sum\limits_{m \in {\lbrack{1\mspace{11mu}\ldots\mspace{11mu} r}\rbrack}^{q}}{\log\;{{p\left( {\beta,\left. m \middle| \alpha \right.,\left( \sigma^{\prime} \right)^{2}} \right)} \cdot {p\left( {\left. m \middle| \alpha \right.,\beta,\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n}} \right)}}}}}} & (10) \end{matrix}$

The value of (σ′)² that maximizes Q has zero first-derivative. The first derivative of Q is given by Equation 11.

$\begin{matrix} {\frac{\partial{Q\left( \left( \sigma^{\prime} \right)^{2} \middle| \left\lfloor \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rfloor_{n} \right)}}{\partial\left( \sigma^{\prime} \right)^{2}} = {\sum\limits_{m \in {\lbrack{1\mspace{11mu}\ldots\mspace{11mu} r}\rbrack}^{q}}{\frac{{\partial\log}\;{p\left( {\beta,\left. m \middle| \alpha \right.,\left( \sigma^{\prime} \right)^{2}} \right)}}{\partial\left( \sigma^{\prime} \right)^{2}} \cdot {p\left( {\left. m \middle| \alpha \right.,\beta,\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n}} \right)}}}} & (11) \end{matrix}$

The probability of the complete data, which appears in the right hand side of Equation 11, can be expressed as a product of probabilities. These factors are expressed in terms of individual measurements in Equations 13 and 14.

$\begin{matrix} {{p\left( {\beta,\left. m \middle| \alpha \right.,\left( \sigma^{\prime} \right)^{2}} \right)} = {{p\left( {\left. \beta \middle| \alpha \right.,\left( \sigma^{\prime} \right)^{2},m} \right)}{p(m)}}} & (12) \\ {{p\left( {\left. \beta \middle| \alpha \right.,\left( \sigma^{\prime} \right)^{2},m} \right)} = {\prod\limits_{i = 1}^{q}\;{p\left( {\left. \beta_{i} \middle| \alpha_{m_{i}} \right.,\left( \sigma^{\prime} \right)^{2}} \right)}}} & (13) \\ {{p(m)} = {\prod\limits_{i = 1}^{q}\;{p\left( \alpha_{m_{i}} \right)}}} & (14) \end{matrix}$

The log-likelihood of the complete data, which appears in the right-hand side of Equation 11, can be expressed as a sum of terms by combining equations 12, 13, and 14.

$\begin{matrix} {{\log\;{p\left( {\beta,\left. m \middle| \alpha \right.,\left( \sigma^{\prime} \right)^{2}} \right)}} = {{{\sum\limits_{i = 1}^{q}\;{\log\;{p\left( {\left. \beta_{i} \middle| \alpha_{m_{i}} \right.,\left( \sigma^{\prime} \right)^{2}} \right)}}} + {\sum\limits_{i = 1}^{q}\;{\log\;{p\left( \alpha_{m_{i}} \right)}}}} = {{\frac{- 1}{2\left( \sigma^{\prime} \right)^{2}}{\sum\limits_{i = 1}^{q}\;\left( \frac{\beta_{i} - \alpha_{m_{i}}}{10^{- 6}\alpha_{m_{i}}} \right)^{2}}} - {\frac{q}{2}{\log\left( \left( \sigma^{\prime} \right)^{2} \right)}} - {\frac{q}{2}{\log\left( {2{\pi\left( {10^{- 6}\alpha_{m_{i}}} \right)}^{2}} \right)}} + {\sum\limits_{i = 1}^{q}\;{\log\;{p\left( \alpha_{m_{i}} \right)}}}}}} & (15) \end{matrix}$

The derivative of the log-likelihood of the complete data with respect to (σ′)² is given in Equation 16.

$\begin{matrix} {\frac{{\partial\log}\;{p\left( {\beta,\left. m \middle| \alpha \right.,\left( \sigma^{\prime} \right)^{2}} \right)}}{\partial\left( \sigma^{\prime} \right)^{2}} = {{\frac{1}{{2\left\lbrack \left( \sigma^{\prime} \right)^{2} \right\rbrack}^{2}}{\sum\limits_{i = 1}^{q}\;\left( \frac{\beta_{i} - \alpha_{m_{i}}}{10^{- 6}\alpha_{m_{i}}} \right)^{2}}} - {\frac{q}{2}\frac{1}{\left( \sigma^{\prime} \right)^{2}}}}} & (16) \end{matrix}$

Then, the right-hand side of Equation 16 is plugged into Equation 10 to obtain the first derivative of Q.

$\begin{matrix} {\frac{\partial{Q\left( \left( \sigma^{\prime} \right)^{2} \middle| \left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n} \right)}}{\partial\left( \sigma^{\prime} \right)^{2}} = {{\frac{1}{{2\left\lbrack \left( \sigma^{\prime} \right)^{2} \right\rbrack}^{2}}{\sum\limits_{m \in {\lbrack{1\mspace{11mu}\ldots\mspace{11mu} r}\rbrack}^{q}}^{\;}{\sum\limits_{i = 1}^{\; q}{\left( \frac{\beta_{i} - \alpha_{m_{i}}}{10^{- 6}\alpha_{m_{i}}} \right)^{2}{p\left( {\left. m \middle| \alpha \right.,\beta,\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n}} \right)}}}}} - \frac{q}{2\left( \sigma^{\prime} \right)^{2}}}} & (17) \end{matrix}$

To determine the value of (σ′)² that maximized Q, the right-hand side of Equation 17 is set to zero and solve for (σ′)². This value is the updated estimate of the normalized error variance.

$\begin{matrix} {\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n + 1} = {\frac{1}{q}{\sum\limits_{m \in {\lbrack{1\mspace{11mu}\ldots\mspace{11mu} r}\rbrack}^{q}}^{\;}{\sum\limits_{i = 1}^{\; q}{\left( \frac{\beta_{i} - \alpha_{m_{i}}}{10^{- 6}\alpha_{m_{i}}} \right)^{2}{p\left( {\left. m \middle| \alpha \right.,\beta,\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n}} \right)}}}}}} & (18) \end{matrix}$

The multi-dimensional sum in the right-hand side of Equation 18 can be simplified by virtue of the separability of p(m|α,β,└({circumflex over (σ)}′)²┘_(n)).

$\begin{matrix} {{p\left( {\left. m \middle| \alpha \right.,\beta,\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n}} \right)} = {\prod\limits_{i = 1}^{q}\;{p\left( {\left. \alpha_{m_{i}} \middle| \beta_{i} \right.,\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n}} \right)}}} & (19) \end{matrix}$

Next, exchange the order of summation and expand the vector sum in the right-hand side of Equation 18 explicitly.

$\begin{matrix} {\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n + 1} = {\frac{1}{q}{\sum\limits_{i = 1}^{\; q}{\sum\limits_{m_{i} = 1}^{\; r}{{p\left( {\left. \alpha_{m_{1}} \middle| \beta_{1} \right.,\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n}} \right)}{\sum\limits_{m_{2} = 1}^{r}{{p\left( {\left. \alpha_{m_{2}} \middle| \beta_{2} \right.,\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n}} \right)}\mspace{14mu}{\quad{\ldots{\quad\;{{\quad\quad}{\sum\limits_{m_{q} = 1}^{r}{{p\left( {\left. \alpha_{m_{q}} \middle| \beta_{q} \right.,\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n}} \right)}\left( \frac{\beta_{i} - \alpha_{m_{i}}}{10^{- 6}\alpha_{m_{i}}} \right)^{2}}}}}}}}}}}}}} & (20) \end{matrix}$

Then, rearrange Equation 20, separating each term in the sum as a product of q terms.

$\begin{matrix} {\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n + 1} = {\frac{1}{q}{\sum\limits_{i = 1}^{\; q}{\left( {\sum\limits_{m_{i} = 1}^{\; r}{{p\left( {\left. \alpha_{m_{1}} \middle| \beta_{1} \right.,\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n}} \right)}\left( \frac{\beta_{i} - \alpha_{m_{i}}}{10^{- 6}\alpha_{m_{i}}} \right)^{2}}} \right).{\prod\limits_{k \neq i}^{\;}\;\left( {\sum\limits_{m_{k} = 1}^{r}{p\left( {\left. \alpha_{m_{k}} \middle| \beta_{k} \right.,\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n}} \right)}} \right)}}}}} & (21) \end{matrix}$

However, each term in the product indexed by k is the sum of disjoint probabilities and therefore unity. To obtain the form in Equation 8, the index on the inner sum is changed from m_(i) to j.

$\begin{matrix} {\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n + 1} = {\sum\limits_{i = 1}^{\; q}{\sum\limits_{j = 1}^{\; r}{\left( \frac{\beta_{i} - \alpha_{j}}{10^{- 6}\alpha_{\;_{j}}} \right)^{2}{p\left( {\left. \alpha_{j} \middle| \beta_{i} \right.,\left\lbrack \left( {\hat{\sigma}}^{\prime} \right)^{2} \right\rbrack_{n}} \right)}}}}} & (22) \end{matrix}$

EXAMPLE 6 Use of a Spline Algorithm

A spline is a smooth function defined on some domain, consisting of a set of smooth segment functions defined on subdomains that form a partition of the original domain. A spline is formed by concatenation of the segment functions. To obtain a smooth spline, constraints are imposed upon the values of the segment functions and their derivatives at the subdomain boundaries. For a spline to be continuous and have n continuous derivatives requires n+1 constraints at each boundary point.

In data analysis, a model function that best fits the data is chosen from a family of related functions, each indexed by a vector of parameter values. When the parameters represent physical quantities, the model function represents an estimate of the state of a system from a set of measurements.

In some cases, a given physical model is a good description of a process only for disjoint local regions of a domain space. A family of functions can be extended to model a larger class of phenomenon by connecting them to form splines. The domain space (the independent variable) is partitioned into regions, each of which is characterized by its own local set of parameter values. The values of the spline parameters in a subdomain are guided by the measurement values from its own subdomain, but also coupled to the parameter values in other domains by virtue of the spline constraints.

Calibration in FTMS involves generalizing the relationship between the measured cyclotron frequency of an ion and its mass-to-charge ratio from a set of observed frequencies of ions of known mass-to-charge ratios. The form of the calibration function is based upon the magnetic and electrostatic forces encountered by ions in an analytic cell. There are a variety of different calibration functions, but the most widely used involves two parameters, A and B (Ledford, E. B. et al., Mass Calibration, Int J Mass Spectrom Ion Process 56: 2744-2748 (1984))

$\begin{matrix} {{m/z} = {\frac{A}{f_{obs}} + \frac{B}{f_{obs}^{2}}}} & (23) \end{matrix}$

Because the motion of ions in an FTMS cell is not fully understood, the parameter values are semi-empirical. Parameter A corresponds to the centripetal magnetic force and the radial component of the electrostatic trapping force. Parameter B corresponds to the “space-charge effect”.

The space-charge effect describes the electrostatic repulsion between analyte ions of different species, causing a net outward force, and a decrease in frequency. The value of parameter B has been shown to be roughly linear in the total number of ions in the analytic cell (Easterling M. L. et al., Anal Chem 71:624-632 (1999)). However, the space-charge effect is fundamentally a local rather than a global phenomenon, with ions influenced disproportionately more by ions of similar frequency. Therefore, the local spectral density of ions appears to affect the observed frequency. Local distortions in the calibration relation have been reported (Masselon C. et al., JASMS 13: 99-106 (2002)).

Spline parameters may be used to estimate the local variations in the calibration parameters with the ultimate goal of improving the accuracy of the estimated m/z values. The frequency domain is partitioned into regions. The choice of partition is driven by the data. Each subdomain has its own local values of calibration parameters A and B, and an additional parameter D, introduced for technical reasons. The first spline segments has three degree of freedom; each additional spline segment introduces three parameters; two of these are required to satisfy the spline constraints; the remaining degree of freedom can be used to fit the data.

The calibration relation between mass-to-charge-ratio and frequencies in the range [f_(lo), f_(hi)) may be determined using a spline as the calibration relation. Let s denote a spline of N segments defined on this region. Let P=(f₀, f₁, . . . f_(N)) with f₀=f_(lo). f_(N)=f_(hi), and f_(i)<f_(j) for i<j denote a partition of the range [f_(lo),f_(hi)). Let si for i in 1 . . . N denote the segment function defined on the subdomain [f_(i−1),f_(i)). For notational convenience, let l(f) denote the subdomain that contains f. I(f)=i iε[f _(i−1),f_(i))  (24)

Let s(f) denote the value of the spline evaluated at f. This is defined as the value of segment function indexed by l(f) evaluated at f. s(f)=s _(I(f))(f)  (25)

Let A_(i), B_(i) denote the local calibration parameters in [f_(i−1),f_(i)), and let D_(i) denote the local shift applied to this region in order to generate a globally smooth spline.

$\begin{matrix} \begin{matrix} {{s_{i}(f)} = {\frac{A_{i}}{f} + \frac{B_{i}}{f^{2}} + D_{i}}} & {f \in \left\lbrack {f_{i - 1},f_{i}} \right\rbrack} \end{matrix} & (26) \end{matrix}$

Combining Equations 25 and 26, the calibration relation generalized to splines is given by

$\begin{matrix} {{s(f)} = {\frac{A_{I{(f)}}}{f} + \frac{B_{I{(f)}}}{f^{2}} + D_{I{(f)}}}} & (27) \end{matrix}$

Let x denote the vector of 3N parameters, combining the three local parameters for each of the N spline segments. x=[A ₁ B ₁ D ₁ . . . A _(N) B _(N) D _(N)]^(T)  (28)

Equation 27 may be written as a product of a row vector r^(T)(f) and vector x. s(f)=r ^(T)(f)x  (29)

Row vector r^(T)(f) has 3N columns, all but three of which are zero: columns 3l(f)−2, 3l(f)−1, and 3l(f) contain entries 1/f, 1/f2, and 1.

In general, the expression for column i of r^(T)(f) can be expressed as follows:

$\begin{matrix} {{{r^{T}(f)}(i)} = {{\delta\left( {{3\left\lfloor \frac{i + 2}{3} \right\rfloor},{I(f)}} \right)}f^{{3{\lfloor{i/3}\rfloor}} - i}}} & (30) \end{matrix}$

The 2(N−1) constraints on parameter vector x that must be satisfied for s to be a smooth spline can be represented by a matrix Equation. Cx=0  (31)

C denotes a constraint matrix of 2(N−1) rows, one for each constraint, and 3N columns, one for each parameter. For example, the constraint that the spline s be continuous at f₁, requires that the following condition holds:

$\begin{matrix} {{s_{1}\left( f_{1} \right)} = {{\frac{A_{1}}{f_{1}} + \frac{B_{1}}{f_{1}^{2}} + D_{1}} = {{s_{2}\left( f_{1} \right)} = {\frac{A_{2}}{f_{1}} + \frac{B_{2}}{f_{1}^{2}} + D_{2}}}}} & \left( {32\; a} \right) \end{matrix}$

Equivalently, in matrix form,

$\begin{matrix} {{\left. \begin{matrix} \left\lbrack \frac{1}{f_{1}} \right. & \frac{1}{f_{1}^{2}} & 1 & {- \frac{1}{f_{1}}} & {- \frac{1}{f_{1}^{2}}} & {- 1} & 0 & \ldots & 0 \end{matrix} \right\rbrack x} = 0} & \left( {32\; b} \right) \end{matrix}$

The constraint that the first derivative of s be continuous at f₁ requires

$\begin{matrix} {{\frac{\mathbb{d}s_{1}}{\mathbb{d}f}{_{f\; 1}{= {{\frac{- A_{1}}{f_{1}^{2}} - \frac{2B_{1}}{f_{1}^{3}}} = \frac{\mathbb{d}s_{2}}{\mathbb{d}f}}}}_{f\; 1}} = {\frac{- A_{2}}{f_{1}^{2}} - \frac{2B_{2}}{f_{1}^{3}}}} & \left( {33a} \right) \end{matrix}$

Equivalently, in matrix form,

$\begin{matrix} {{\left. \begin{matrix} \left\lbrack \frac{1}{f_{1}^{2}} \right. & \frac{1}{f_{1}^{3}} & 0 & {- \frac{1}{f_{1}^{2}}} & {- \frac{1}{f_{1}^{3}}} & 0 & 0 & \ldots & 0 \end{matrix} \right\rbrack x} = 0} & \left( {33\; b} \right) \end{matrix}$

Let C₁ denote the banded diagonal matrix of N−1 continuity constraints, and C₂ denote the banded diagonal matrix of N−1 first-derivative constraints. Then, C is the matrix formed by stacking C₁ and C₂.

$\begin{matrix} {C = \begin{bmatrix} C_{1} \\ C_{2} \end{bmatrix}} & (34) \end{matrix}$

The general entries (in row i column j) of C₁ and C₂ respectively are given below.

$\begin{matrix} {{C_{1}\left( {i,j} \right)} = {{\delta\left( {{3\left\lfloor \frac{i + 2}{3} \right\rfloor},j} \right)}f_{j}^{{3{\lfloor{i/3}\rfloor}} - i}}} & \left( {35a} \right) \\ {{C_{2}\left( {i,j} \right)} = {{\delta\left( {{3\left\lfloor \frac{i + 2}{3} \right\rfloor},j} \right)}\left( {{3\left\lfloor {i/3} \right\rfloor} - i} \right)f_{j}^{{3{\lfloor{i/3}\rfloor}} - i - 1}}} & \left( {35b} \right) \end{matrix}$

Let f denote the vector whose components are the measured frequencies of K distinct ions. f=[f ^(obs) ₁ . . . f ^(obs) _(K)]^(T)  (36)

Let m denote the vector that contains the corresponding (known) m/z values of these ions. m=[m ₁ . . . m _(K)]^(T)  (37)

Let m^(calc) denote the vector of values calculated from corresponding f^(obs) using the vector of calibration parameters x and the calibration relation in Equation 27. m ^(calc) =[m ^(calc) ₁ . . . m ^(calc) _(K)]^(T)  (38a) m ^(calc) _(i) =S(f ^(obs) _(i))  (38b)

Let e denote the weighted squared error between the observed m/z values and the corresponding calculated values.

$\begin{matrix} {e = {\sum\limits_{k = 1}^{K}\;{w_{k}\left( {m_{k}^{calc} - m_{k}} \right)}^{2}}} & (39) \end{matrix}$

It may be assumed that the errors are normally distributed with the standard error proportional to the mass. Therefore, the weights are given by the inverse mass squared.

$\begin{matrix} {w_{k} = \frac{1}{m_{k}^{2}}} & (40) \end{matrix}$

The goal is to find the parameter vector x that minimizes the e subject to the constraint Cx=0, i.e. the smooth calibration spline that best fits the observed data. Because the log-likelihood is equal to −e (plus some terms that can be ignored because they are independent of x), if x minimizes e it also maximizes the data likelihood.

Because the constraint is linear, the solution to the constrained optimization problem exists in closed form and can be found using the method of Lagrange multipliers.

To construct the solution, Equation 38 may be expressed in matrix form. First the vector m^(calc) may be expressed in terms of a matrix Equation. To do so, matrix R may be constructed by stacking the row vectors defined by Equation 30 evaluated for each observed frequency.

$\begin{matrix} {R = \begin{bmatrix} {r^{T}\left( f_{1}^{obs} \right)} \\ \vdots \\ {r^{T}\left( f_{K}^{obs} \right)} \end{bmatrix}} & (41) \end{matrix}$

Then, combining Equation 41 with Equations 29 and 38ab, the vector m^(calc) is the product of matrix R and parameter vector x. m ^(calc) =Rx  (42)

Next, a diagonal matrix W is defined whose entries are the weights defined in Equation 40. W(i,j)=δ(i,j)w _(j)  (43)

Then, combining Equations 42 and 43 with Equation 39, a matrix expression for the squared error is obtained. e=(Rx−m)^(T) W(Rx−m)  (44) Let X* denote the value of x that minimizes e subject to the constraint Cx=0. x*=(R ^(T) WR)⁻¹ R ^(T) Wm−(R ^(T) WR)⁻¹ C ^(T) [C(R ^(T) WR)⁻¹ C ^(T)]⁻¹ C(R ^(T) WR)⁻¹ R ^(T) Wm  (45)

This is the set of parameters that describe a maximum-likelihood spline relation between observed frequencies and m/z.

When calibration is performed on samples without analytes of known mass-to-charge ratio, the maximum likelihood vector of spline parameters can also be written in terms of Equation 45, except that the matrices W and R and the vector m must be modified.

When an ion mass is not known, its mass is characterized by a probability mass function. For example, suppose that the m_(k) could be any of the following n_(k) values m_(k1), m_(k2), . . . or m_(knk). Suppose also that the probability that the true m/z value is equal to each of these values is p_(k1), p_(k2), . . . and p_(knk) respectively. In the case of uncertain m/z values, the expectation of the squared error is minimized, where the error is taken to be a random variable.

$\begin{matrix} {e = {\sum\limits_{k = 1}^{K}{\sum\limits_{i = 1}^{n_{k}}{p_{ki}{w_{k}\left( {m_{k}^{calc} - m_{ki}} \right)}^{2}}}}} & (46) \end{matrix}$

The term e may be written in matrix form by collapsing the double-sum in Equation 46 into a single sum. The vector m may be constructed as shown in Equation 37, except that each scalar known mass m_(k) may be replaced with the vector of n_(k) candidate mass values (m_(k1), m_(k2), . . . m_(knk)). Likewise, the vector m^(calc) may be constructed as shown in Equation 38a, except that the each scalar calculated mass m^(calc) _(k) may be replaced with a vector containing n_(k) copies of m^(calc) _(k). The diagonal matrix of weights, originally defined, by Equation 43, is similarly modified. In place of each scalar diagonal entry, a block-diagonal matrix is formed, with K blocks denoted by W_(k). W=diag(W _(k))  (47)

The matrix Wk is itself a diagonal matrix with n_(k) entries. Each weight is the product of the inverse mass squared and the candidate probability. W _(k)(i,j)=δ(i,j)p _(ki) w _(k)  (48)

EXAMPLE 7 Calibration Test with Simulated Data Calibration of Tryptic Peptide Mixtures Does not Require Calibration Standards

A simulation experiment was performed to validate a calibration program that used probabilistic peptide identifications rather than known calibrant masses. Peptide masses were selected randomly from a database of human proteome tryptic peptides. A set of ion cyclotron frequencies was calculated from the mass values assuming all peptides had +1 charge and using values for the calibration parameters that are typical for the LTQ-FT. Observed frequencies were simulated by adding random shifts to the calculated frequencies. Calibration errors were introduced by random shifts to the chosen calibration parameter values. For errors of typical size (e.g. 1 ppm), it was possible to recalibrate the spectra without using knowledge of the original mass values, but only that the peptides were randomly selected from the database. To allow discovery of modified peptides, a database of “typical” tryptic peptide chemical formulas was constructed. The database contains the most frequently occurring chemical formulas of fragments that would be generated by tryptic digest of random amino acid sequences.

The data simulation consisted of three parts: selection of peptide masses, conversion of masses to cyclotron frequencies, and introduction of random errors in the frequency values.

The spectrum was driven by the selection of peptide masses at random from a database that contains an in silico tryptic digest of the human proteome. The resulting digest produced 342,623 distinct mass values. Peptide masses were chosen uniformly at random from this list. The number of peptides in the spectrum was a variable parameter.

To ionize a peptide of neutral mass m_(N), the charge z was chosen to be defined by Equation 49. z=┌m _(N)/2000┐  (49)

The mass of the ion m_(l) is the neutral mass plus the mass of z protons. The mass of a proton m_(p) is 1.007276 Da. m _(I) =m _(N) +zm _(p)  (50)

The ideal cyclotron frequency depends upon the mass to charge ratio of the ion. m _(I) /z=(m _(N) +zm _(p))/z=m _(N) /z−m _(p)  (51)

Hereafter, m/z (dropping the subscript l) was used to denote the mass to charge ratio of the ion.

The choice for z placed an upper limit of (approximately) 2,000 on m/z, which is typical for FTMS data collection in proteomic experiments. Each m/z value was converted into an ideal cyclotron frequency. Typically, the calibration relation is defined in terms of the ideal cyclotron frequency for an ion. For example, the common relation was used as shown in Equation 52.

$\begin{matrix} {{m/z} = {\frac{A}{f} + \frac{B}{f^{2}}}} & (52) \end{matrix}$

Note that the second term in the right-hand side of Equation 49 is small compared with the first-term. In some calculations, like analysis of the effect of frequency measurement error upon the mass-to-charge ratio (see below), the following approximation was acceptable.

$\begin{matrix} {{m/z} \cong \frac{A}{f}} & (53) \end{matrix}$

Equation 54 has two solutions.

$\begin{matrix} {f = {\frac{A}{2\left( {m/z} \right)} \pm \frac{\sqrt{A^{2} + {4{B\left( {m/z} \right)}}}}{2\left( {m/z} \right)}}} & (54) \end{matrix}$

The smaller of the two frequencies is the magnetron frequency. The larger value was desired, the cyclotron frequency, which is slightly smaller than A/(m/z). The values for A and B of 1.075*10⁸ and −3.455*10⁸ were chosen respectively. These values approximate typical values for the Thermo LTQ-FT. Using these calibration parameters, each m/z value was plugged into Equation 54 to generate an ideal cyclotron frequency. These values are referred to as A_(true) and B_(true). The values of A_(true) and B_(true) were not available to the calibration program that subsequently analyzed the simulated data. The ideal frequency generated from Equation 54 will be referred to as f_(true).

A mean-zero Gaussian random variable was added to each cyclotron frequency to simulate additive measurement error, denoted by e in Equation 55. The resulting frequency was denoted by f_(obs). f _(obs) =f _(true) +e  (55)

The standard deviation of the random error e was set to be proportional to the true frequency.

$\begin{matrix} {\sigma_{e} = {\frac{x}{10^{6}}f_{true}}} & (56) \end{matrix}$

The term x denoted the measurement error in parts-per-million (ppm). Note that a given ppm error in the frequency produces an approximately equivalent ppm error in mass, as can be derived by differentiating both sides of (53).

$\begin{matrix} {\frac{d\left( {m/z} \right)}{\left( {m/z} \right)} \cong \frac{df}{f}} & (57) \end{matrix}$

The error in this approximation is insignificant for typical calibration parameters. The simulated data consisted of a set of “observed” cyclotron frequencies, generated as described above. The number of observed frequencies was a variable parameter, which was denoted by N. The performance of the algorithm depended upon N as described below.

In addition to the parameters controlling the data simulation, there were a number of parameters that controlled the algorithm. The most important of these was the initial estimates of the calibration parameters A and B. These initial estimates are denoted by A₀ and B₀ respectively. In practice, these parameters may be the last known calibration parameters for the machine—either the output of the algorithm on the previous scan or the result of calibration on a previous run,

In testing the algorithm, the chosen values differed slightly from the true values of A and B described above to simulate realistic errors in calibration. Analysis may be helpful in determining how to appropriately miscalibrate spectra.

Consider the effect of errors in both A and B upon m/z by modifying Equation 52.

$\begin{matrix} {{\Delta\left( {m/z} \right)} = {\frac{\Delta\; A}{f} + \frac{\Delta\; B}{f^{2}}}} & (58) \end{matrix}$

Setting Δ(m/z) to zero and solving for ΔB indicates that the calibration error will be equal to zero for some value of f. Let f₀ denote the value where the calibration error is zero. ΔB=−ΔA(f ₀)  (59)

Combining Equations 58 and 59, produces an Equation for the calibration error in m/z as a function of ΔA and f₀.

$\begin{matrix} {{\Delta\left( {m/z} \right)} = {\frac{\Delta\; A}{f}\left\lbrack {1 - \frac{f_{0}}{f}} \right\rbrack}} & (60) \end{matrix}$

Combining Equation 60 with (53), produces an approximation for the normalized calibration error.

$\begin{matrix} {\frac{\Delta\left( {m/z} \right)}{\left( {m/z} \right)} \cong {\frac{\Delta\; A}{A}\left\lbrack {1 - \frac{f_{0}}{f}} \right\rbrack}} & (61) \end{matrix}$

The root-mean-squared normalized calibration error in a spectrum with observed frequencies (f1 . . . fN) can be approximated from (61). Replacing the true frequencies with the observed frequencies should not significantly change our estimate.

$\begin{matrix} {{{rms}\left\lbrack \frac{\Delta\left( {m/z} \right)}{\left( {m/z} \right)} \right\rbrack} \cong {\frac{\Delta\; A}{A}\sqrt{\sum\limits_{i = 1}^{N}\left\lbrack {1 - \frac{f_{0}}{f_{i}}} \right\rbrack^{2}}}} & (62) \end{matrix}$

The error is minimized when f₀ is chosen to be the reciprocal average of the reciprocal frequency. This value of f₀, denoted by f₀* in Equation 59, eliminates systematic calibration errors in a given spectrum.

$\begin{matrix} {f_{0}^{*} = \left( {\frac{1}{N}{\sum\limits_{i = 1}^{N}\frac{1}{f_{i}}}} \right)^{- 1}} & (63) \end{matrix}$

The first six parameters describe the generation of simulated data. The values of A_(true) and B_(true) are typical calibration parameters that have been have encountered when running the Thermo LTQ-FT. The values of A_(init) and B_(init) were chosen to introduce miscalibration. A_(init) differed from A_(true) by 2 ppm. From Equation 55, it was observed that introduced calibration errors bounded above by 2 ppm for large masses. The value of B_(init) was chosen so that f₀ (Equation 55) would be near the center of the spectrum. This combination of A_(0init) and B_(0init) placed the zero point for the calibration at m/z ˜2000.

The number of peaks was arbitrarily set to 50 to represent a typical mass spectrum. The algorithm may perform better given more peaks. The measurement error describes the normalized rms deviation between the true cyclotron frequency and the observed value.

The last three parameters governed the calibration algorithm. In the above example, the initial error estimate was intentionally chosen to be much larger than the actual error. The number of iterations for the error estimator and calibrator were chosen to be much larger than what is typically required for convergence.

The algorithm proved to be robust to a variety of conditions. The data are shown in FIG. 5. In the high mass region inset of FIG. 5, the true masses lie on the x-axis. The first dashed vertical line denotes a low-confidence identification because several candidates are within ±1σ of the true mass value. The second dotted line denotes a high-confidence identification because there is only one candidate within ±1σ of the true mass value. There were no candidates in ±1σ. In summary, 50 random human tryptic peptides were analyzed (m=[0,2000], z=1).

The parameters characterizing the simulated data were the number of peptides in the spectrum and the measurement error. The performance of the calibration algorithm would be expected to increase with the number of peptides. This is because the initial convergence of the algorithm depends upon being able to unambiguously identify at least a small number of peptide masses. The probability that this condition is satisfied increases exponentially with the number of peptides in the spectrum. Similarly, the performance of the algorithm would be inversely correlated with the size of the measurement error. Large errors may make it difficult to identify peptide masses.

While the description above refers to particular embodiments of the present invention, it should be readily apparent to people of ordinary skill in the art that a number of modifications may be made without departing from the spirit thereof. The presently disclosed embodiments are, therefore, to be considered in all respects as illustrative and not restrictive. 

1. A method of producing a calibrated mass spectrum, comprising: a) providing a sample comprising two or more analytes; b) subjecting the sample to mass spectrometry to obtain a mass spectrum, wherein the mass spectrum comprises un-calibrated data; c) extracting the peaks from the spectrum and assigning a position and an ion charge to each peak; d) providing input parameters comprising: (i) initial estimates of calibration parameters wherein the calibration parameters relate the observed peaks in the mass spectrum to mass-to-charge ratio; and (ii) initial estimate of root-mean-squared error in the calibrated mass values; e) providing a list of masses of analytes from a database to provide candidate analytes present in the sample, wherein a database comprises a list of elemental compositions and corresponding mass values; f) converting each peak position determined in step (c) to an estimated mass-to-charge ratio using the input parameters; g) calculating an estimated mass of the neutral analyte molecule from the mass-to-charge ratio estimate in step (f) and the ion charge determined in step (c); h) assigning probabilities to one or more entries in the database as the identity of the analyte based on the estimate of the mass from step (g) and the estimate of root-mean-squared error; i) updating the estimated values of the calibration parameters based on the assigned probabilities in step (h); j) updating the estimated root-mean-squared error using the updated calibration parameters from step (i); and k) repeating steps f) through i) until convergence is reached, whereby a calibrated mass spectrum is produced and candidate identities are assigned to each peak in the spectrum.
 2. The method of claim 1, wherein the input parameters further comprise, updated calibration parameters, an updated estimate of root-mean-squared or combinations thereof.
 3. The method of claim 1, wherein the mass spectrometry is Fourier transform mass spectrometry.
 4. The method of claim 1, wherein the mass spectrometry output comprises cyclotron frequencies.
 5. The method of claim 1, wherein the elemental composition probabilities are peptide probabilities.
 6. The method of claim 1, wherein the sample is selected from the group consisting of blood, plasma, serum, spinal fluid, urine, sweat, saliva, tears, breast aspirate, prostate fluid, seminal fluid, vaginal fluid, stool, cervical scraping, cytes, amniotic fluid, intraocular fluid, mucous, moisture in breath, animal tissue, cell lysates, tumor tissue, hair, skin, buccal scrapings, nails, bone marrow, cartilage, prions, bone powder, ear wax, and combinations thereof.
 7. The method of claim 1, wherein the elemental composition comprises at least one peptide.
 8. The method of claim 1, wherein the sample is selected from the group consisting of hydrocarbons, petroleum products, nucleotides, combinatorial samples, polymeric samples, and combinations thereof.
 9. The method of claim 1, wherein the sample is a petroleum product.
 10. The method of claim 1, wherein the estimating the root-mean-squared error and elemental composition probabilities comprises using an Expectation Maximization algorithm.
 11. The method of claim 1, wherein the estimating the root-mean-squared error and elemental composition probabilities comprises using a spline algorithm.
 12. A mass spectrometry calibration system, comprising: A) a mass spectrometry device to analyze a sample and produce a mass spectrometry output, wherein said mass spectrometry output comprises un-calibrated data, and wherein the sample does not comprise a specific calibrant; and B) calibration software configured to: i) receive input parameters, and wherein the input parameters comprise (a) initial estimates of calibration parameters wherein the calibration parameters relate the observed peaks in the mass spectrum to mass-to-charge ratio; and (b) initial estimate of root-mean-squared error in the calibrated mass values, ii) receive a list of exact masses of analytes from a database to provide candidate analytes present in the sample, wherein a database comprises a list of elemental compositions and corresponding mass values iii) convert each peak position to an estimated mass-to-charge ratio using the input parameters, iv) calculate an estimated mass of the neutral analyte molecule from the mass-to-charge ratio estimate and the ion charge, v) assign probabilities to one or more entries in the database as the identity of the analyte based on the estimate of the mass and the estimate of root-mean-squared error (vi) update the estimated values of the calibration parameters based on the assigned probabilities; (vii) update the estimated root-mean-squared error using the updated calibration parameters; and vi) repeat steps iii) through vii) until convergence is reached, whereby a calibrated mass spectrum is produced and candidate identities are assigned to each peak in the spectrum.
 13. The system of claim 12, wherein the input parameters are selected from the group consisting of initial calibration parameters, an initial root-mean-squared error estimate, updated calibration parameters, an updated root-mean-squared error estimate, and combinations thereof.
 14. The system of claim 12, wherein the mass spectrometry device is a Fourier transform mass spectrometer.
 15. The system of claim 12, wherein the mass spectrometry output comprises cyclotron frequencies.
 16. The system of claim 12, wherein the elemental composition probabilities are peptide probabilities.
 17. The system of claim 12, wherein the sample is selected from the group consisting of blood, plasma, serum, spinal fluid, urine, sweat, saliva, tears, breast aspirate, prostate fluid, seminal fluid, vaginal fluid, stool, cervical scraping, cytes, amniotic fluid, intraocular fluid, mucous, moisture in breath, animal tissue, cell lysates, tumor tissue, hair, skin, buccal scrapings, nails, bone marrow, cartilage, prions, bone powder, ear wax, and combinations thereof.
 18. The system of claim 12, wherein the sample comprises at least one peptide.
 19. The system of claim 12, wherein the sample is selected from the group consisting of hydrocarbon ns, petroleum products, nucleotides, combinatorial samples, polymeric samples, and combinations thereof.
 20. The system of claim 12, wherein the sample is a petroleum product.
 21. The system of claim 12, wherein the software is configured to estimate the root-mean-squared error and the elemental composition probabilities using an Expectation Maximization algorithm.
 22. The system of claim 12, wherein the software is configured to estimate the root-mean-squared error and the elemental composition probabilities using a spline algorithm.
 23. A computer-readable medium having computer-executable instructions that when executed perform a method, the method comprising: a) converting a mass spectrum comprising un-calibrated data to mass values using input parameters, b) extracting the peaks from the spectrum and assigning a position and an ion charge to each peak; c) providing input parameters comprising: (i) initial estimates of calibration parameters wherein the calibration parameters relate the observed peaks in the mass spectrum to mass-to-charge ratio; and (ii) initial estimate of root-mean-squared error in the calibrated mass values; d) providing a list of exact masses of analytes from a database to provide candidate analytes present in the sample, wherein a database comprises a list of elemental compositions and corresponding mass values; e) converting each peak position determined in step (b) to an estimated mass-to-charge ratio using the input parameters; f) calculating an estimated mass of the neutral analyte molecule from the mass-to-charge ratio estimate in step (e) and the ion charge determined in step (b); g) assigning probabilities to one or more entries in the database as the identity of the analyte based on the estimate of the mass from step (f) and the estimate of root-mean-squared error; h) updating the estimated values of the calibration parameters based on the assigned probabilities in step (g); i) updating the estimated root-mean-squared error using the updated calibration parameters from step (h); and j) repeating steps e) through i) until convergence is reached, whereby a calibrated mass spectrum is produced and candidate identities are assigned to each peak in the spectrum.
 24. The computer-readable medium of claim 23, wherein the input parameters are selected from the group consisting of initial calibration parameters, an initial root-mean-squared error estimate, and combinations thereof.
 25. The computer-readable medium of claim 23, wherein the estimating the root-mean-squared error and the elemental composition probabilities uses an Expectation Maximization algorithm.
 26. The computer-readable medium of claim 23, wherein the estimating the root-mean-squared error and the elemental composition probabilities uses a spline algorithm.
 27. The computer-readable medium of claim 23, wherein the mass spectrometry output is produced by a Fourier transform mass spectrometer.
 28. The computer-readable medium of claim 23, wherein the mass spectrometry output comprises cyclotron frequencies.
 29. The computer-readable medium of claim 23, wherein the elemental composition probabilities are peptide probabilities. 